Estimating Prevalence Correctly

Complex Sampling in National Surveys

Dr Mohd Azmi Bin Suliman

Pusat Penyelidikan Penyakit Tak Berjangkit, Institut Kesihatan Umum

Sunday, 16 November 2025

Institut Kesihatan Umum (IKU)

  • Conducts the National Health and Morbidity Survey (NHMS).
  • Generates national evidence on population health and NCDs.
  • Supports policy and programme planning with reliable data.

From Population to Sample

  • We study populations through samples, not by measuring everyone.
  • Samples may differ in age, sex, or ethnicity distribution.
  • National surveys must ensure representativeness, not just inclusion.

Malaysian Population, 2025

  • Data: Department of Statistics Malaysia (DOSM)
  • Shows Malaysia’s age–sex structure.

What is Complex Sampling?

  • Stratification ensures key subgroups are represented.
  • Clustering groups respondents for logistical efficiency.
  • Unequal selection probabilities corrected with weights.
  • Requires design-based analysis for valid inference.

Why Use Complex Sampling?

  • Covers diverse populations efficiently.
  • Reduces cost and fieldwork burden.
  • Enables accurate national and state estimates.

Example (NHMS 2023) – Diabetes Prevalence

Age Group % with DM 95% CI
18–29 3.2 2.2–4.6
30–39 6.5 5.2–8.1
40–49 15.2 13.2–17.4
50–59 28.8 25.0–33.0
60+ 38.0 35.4–40.7

Simulation Dataset

  • 1,100 synthetic respondents.
  • Age groups: 18–29 to 60+.
  • Sex ratio 40 % male / 60 % female.
  • Ethnicity: 65 % Malay, 20 % Chinese, 15 % Indian.
  • Diabetes prevalence ≈ 22 %.

Crude Prevalence

Characteristic N = 1,1001
dm
    No DM 858 (78.0%)
    DM 242 (22.0%)
1 n (%)
  • Crude prevalence ≈ 22 %.
  • Does this represent Malaysia accurately?

Sample vs Population

  • Sample has more older and female respondents.

Why Adjust Weights?

  • To match sample structure to true population totals.
  • Ensures estimates represent Malaysia accurately.

Post-Stratification

  • Aligns weights by age, sex, ethnicity.
  • Corrects for unequal sampling probabilities.
Age Group Sex Ethnicity Sample Count (n) Init. Est. Pop. Malaysia Population ('000) Post-strat Factor
18-29 male malay 52 1040 1910.68 1.8371923
18-29 male indian 12 240 201.46 0.8394167
18-29 female malay 78 1560 1790.28 1.1476154
18-29 female indian 18 360 188.38 0.5232778
40-49 male malay 52 1040 1232.30 1.1849038
40-49 male indian 12 240 161.50 0.6729167
40-49 female malay 78 1560 1203.60 0.7715385
40-49 female indian 18 360 155.50 0.4319444
60+ male malay 78 1560 982.90 0.6300641
60+ male indian 18 360 129.20 0.3588889
60+ female malay 117 2340 1064.90 0.4550855
60+ female indian 27 540 149.60 0.2770370

Post-Strat Effect – Age

  • Younger adults up-weighted (f > 1).
  • Older adults down-weighted (f < 1).
gender ethnicity age_group n_samp est_pop pop_k pop_fct
male malay 18-29 52 1040 1910.68 1.8371923
male malay 30-39 52 1040 1419.40 1.3648077
male malay 40-49 52 1040 1232.30 1.1849038
male malay 50-59 52 1040 814.80 0.7834615
male malay 60+ 78 1560 982.90 0.6300641
female malay 18-29 78 1560 1790.28 1.1476154
female malay 30-39 78 1560 1419.10 0.9096795
female malay 40-49 78 1560 1203.60 0.7715385
female malay 50-59 78 1560 828.40 0.5310256
female malay 60+ 117 2340 1064.90 0.4550855

Post-Strat Effect – Gender

  • Males under-sampled → weight ↑.
  • Females over-sampled → weight ↓.
gender ethnicity age_group n_samp est_pop pop_k pop_fct
male malay 18-29 52 1040 1910.68 1.8371923
female malay 18-29 78 1560 1790.28 1.1476154
male malay 40-49 52 1040 1232.30 1.1849038
female malay 40-49 78 1560 1203.60 0.7715385
male malay 60+ 78 1560 982.90 0.6300641
female malay 60+ 117 2340 1064.90 0.4550855

Before and After Weighting

  • Weighting restores population structure.

Corrected Prevalence (Weighted)

Comparison of Crude and Weighted Estimates of Diabetes Prevalence
Characteristic
Crude (Unweighted)
Weighted (Post-Stratified)
DM
N = 2421
DM
N = 3,2691
my

    Overall 242 (22.0%) 3,269 (16.7%)
gender

    female 147 (22.3%) 1,672 (17.2%)
    male 95 (21.6%) 1,597 (16.2%)
age_group

    18-29 8 (4.0%) 211 (4.0%)
    30-39 14 (7.0%) 284 (6.8%)
    40-49 34 (17.0%) 609 (15.8%)
    50-59 63 (31.5%) 816 (29.5%)
    60+ 123 (41.0%) 1,349 (38.8%)
1 n (%)
  • Weighted estimates reflect true population.
  • Crude estimates reflect only sample.

Caveats in Complex Sampling

  • Needs a known sampling frame.
  • Requires larger sample to offset design effect.
  • Intra-cluster correlation reduces precision.
  • Standard tests without weights → invalid results.

Summary

  • Complex sampling improves representativeness.
  • Weighting corrects for unequal selection.
  • Post-stratification aligns sample to population.
  • Corrected estimates are valid and comparable nationally.